python pandas groupby complex calculations - python

I have dataframe df with columns: a, b, c,d. And i want to group data by a and make some calculations. I will provide code of this calculations in R. My main question is how to do the same in pandas?
library(dplyr)
df %>%
group_by(a) %>%
summarise(mean_b = mean(b),
qt95 = quantile(b, .95),
diff_b_c = max(b-c),
std_b_d = sd(b)-sd(d)) %>%
ungroup()
This example is synthetic, I just want to understand pandas syntaxis

I believe you need custom function with GroupBy.apply:
def f(x):
mean_b = x.b.mean()
qt95 = x.b.quantile(.95)
diff_b_c = (x.b - x.c).max()
std_b_d = x.b.std() - x.d.std()
cols = ['mean_b','qt95','diff_b_c','std_b_d']
return pd.Series([mean_b, qt95, diff_b_c, std_b_d], index=cols)
df1 = df.groupby('a').apply(f)

Related

I want to get grouped counts and percentages using pandas

table shows count and percentage count of cycle ride, grouped by membership type (casual, member).
What I have done in R:
big_frame %>%
group_by(member_casual) %>%
summarise(count = length(ride_id),
'%' = round((length(ride_id) / nrow(big_frame)) * 100, digit=2))
best I've come up with in Pandas, but I feel like there should be a better way:
member_casual_count = (
big_frame
.filter(['member_casual'])
.value_counts(normalize=True).mul(100).round(2)
.reset_index(name='percentage')
)
member_casual_count['count'] = (
big_frame
.filter(['member_casual'])
.value_counts()
.tolist()
)
member_casual_count
Thank you in advance
In R, you should be doing something like this:
big_frame %>%
count(member_casual) %>%
mutate(perc = n/sum(n))
In python, you can achieve the same like this:
(
big_frame
.groupby("member_casual")
.size()
.to_frame('ct')
.assign(n = lambda df: df.ct/df.ct.sum())
)

How to aggregate a dataframe then transpose it with Pandas

I'm trying to achieve this kind of transformation with Pandas.
I made this code but unfortunately it doesn't give the result I'm searching for.
CODE :
import pandas as pd
df = pd.read_csv('file.csv', delimiter=';')
df = df.count().reset_index().T.reset_index()
df.columns = df.iloc[0]
df = df[1:]
df
RESULT :
Do you have any proposition ? Any help will be appreciated.
First create columns for test nonOK and then use named aggregatoin for count, sum column Values and for count Trues values use sum again, last sum both columns:
df = (df.assign(NumberOfTest1 = df['Test one'].eq('nonOK'),
NumberOfTest2 = df['Test two'].eq('nonOK'))
.groupby('Category', as_index=False)
.agg(NumberOfID = ('ID','size'),
Values = ('Values','sum'),
NumberOfTest1 = ('NumberOfTest1','sum'),
NumberOfTest2 = ('NumberOfTest2','sum'))
.assign(TotalTest = lambda x: x['NumberOfTest1'] + x['NumberOfTest2']))

How to merge multiple columns with same names in a dataframe

I have the following dataframe as below:
df = pd.DataFrame({'Field':'FAPERF',
'Form':'LIVERID',
'Folder':'ALL',
'Logline':'9',
'Data':'Yes',
'Data':'Blank',
'Data':'No',
'Logline':'10'}) '''
I need dataframe:
df = pd.DataFrame({'Field':['FAPERF','FAPERF'],
'Form':['LIVERID','LIVERID'],
'Folder':['ALL','ALL'],
'Logline':['9','10'],
'Data':['Yes','Blank','No']}) '''
I had tried using the below code but not able to achieve desired output.
res3.set_index(res3.groupby(level=0).cumcount(), append=True['Data'].unstack(0)
Can anyone please help me.
I believe your best option is to create multiple data frames with the same column name ( example 3 df with column name : "Data" ) then simply perform a concat function over Data frames :
df1 = pd.DataFrame({'Field':'FAPERF',
'Form':'LIVERID',
'Folder':'ALL',
'Logline':'9',
'Data':'Yes'}
df2 = pd.DataFrame({
'Data':'No',
'Logline':'10'})
df3 = pd.DataFrame({'Data':'Blank'})
frames = [df1, df2, df3]
result = pd.concat(frames)
You just need to add to list in which you specify the logline and data_type for each row.
import pandas as pd
import numpy as np
list_df = []
data_type_list = ["yes","no","Blank"]
logline_type = ["9","10",'10']
for x in range (len(data_type_list)):
new_dict = { 'Field':['FAPERF'], 'Form':['LIVERID'],'Folder':['ALL'],"Data" : [data_type_list[x]], "Logline" : [logline_type[x]]}
df = pd.DataFrame(new_dict)
list_df.append(df)
new_df = pd.concat(list_df)
print(new_df)

Is there an python process for this R code

i am working on python and i want to group by my data by to columns and at the same time add missing dates from a date1 corresponding to the occurrence of the event to another date2 corresponding to a date that a choose and fill the missing values into the columns i decided by forwarfill .
I tried the code bellow on r and its works i want to do the same in python
library(data.table)
library(padr)
library(dplyr)
data = fread("path", header = T)
data$ORDERDATE <- as.Date(data$ORDERDATE)
datemax = max(data$ORDERDATE)
data2 = data %>%
group_by(Column1, Column2) %>%
pad(.,group = c('Column1', 'Column2'), end_val = as.Date(datemax), interval = "day",break_above = 100000000000) %>%
tidyr::fill("Column3")
I search for the corresponding package library(padr) in python but couldn't find any.
Thank your for answering my request.
As required as an example i have this table:
users=['User1','User1','User2','User1','User2','User1','User2','User1','User2'],
products=['product1','product1','product1','product1','product1','product2','product2','product2','product2'],
quantities=[5,6,8,10,4,5,2,9,7],
prices=[2,2,5,5,6,6,6,7,7],
data = pd.DataFrame({'date':dates,'user':users,'product':products,'quantity':quantities,'price':prices}),
data['date'] = pd.to_datetime(data.date, format='%Y-%m-%d'),
data2=data.groupby(['user','product','date'],as_index=False).mean()```[enter image description here][1]
for User1 and product1 for exemple i want to input missing dates and fill the quantities column with the value 0 and the column price with backward values from a range of date that a choose.
And do the same by users and by product for remainings in my data.
the result should look like this:
[1]: https://i.stack.imgur.com/qOOda.png
the r code i used to generate the image is as follow:
```library(padr)
library(dplyr)
dates=c('2014-01-14','2014-01-14','2014-01-15','2014-01-19','2014-01-18','2014-01-25','2014-01-28','2014-02-05','2014-02-14')
users=c('User1','User1','User2','User1','User2','User1','User2','User1','User2')
products=c('product1','product1','product1','product1','product1','product2','product2','product2','product2')
quantities=c(5,6,8,10,4,5,2,9,7)
prices=c(2,2,5,5,6,6,6,7,7)
data=data.frame(date=c('2014-01-14','2014-01-14','2014-01-15','2014-01-19','2014-01-18','2014-01-25','2014-01-28','2014-02-05','2014-02-14'),user=c('User1','User1','User2','User1','User2','User1','User2','User1','User2'),product=c('product1','product1','product1','product1','product1','product2','product2','product2','product2'),quantity=c(5,6,8,10,4,5,2,9,7),price=c(2,2,5,5,6,6,6,7,7))
data$date <- as.Date(data$date)
datemax = max(data$date)
data2 = data %>% group_by(user, product) %>% pad(.,group = c('user', 'product'), end_val = as.Date(datemax), interval = "day",break_above = 100000000000)
data3=data2 %>% group_by(user,product,date) %>%
summarize(quantity=sum(quantity),price=mean(price))
data4=data3%>% tidyr::fill("price")%>% fill_by_value(quantity, value = 0)```

Pandas: apply a specific function to columns and create column in new dataframe

I have a dataframe df1, like this:
date sentence
29/03/1029 i like you
.....
I want to create new dataframe df2 like this:
date verb object
29/03/2019 like you
....
with the function like this:
def getSplit(df1):
verbList = []
objList = []
df2 = pd.DataFrame()
for row in df1['sentence']:
verb = getVerb(row)
obj = getObj(row)
verbList.append(verb)
objList.append(obj)
df2 = df1[[date]].copy
df2['verb'] = verbList
df2['object'] = objList
return df2
my function run well, but it's slow. Could someone help me improve the function so that can run faster?
Thank you
You can Use apply method of pandas to process fast:-
getverb(row):
pass # Your function
getobj(row):
passs # Your function
df2 = df1.copy() # Making copy of your dataframe.
df2['verb'] = df2['sentence'].apply(getverb)
df2['obj'] = df2['sentence'].apply(getobj)
df2.drop('sentence', axis=1, inplace=True) # Droping sentence column
df2
I hope it may help you. (accept and upvote answer)

Categories