Aggregate by percentile and count for groups in python - python

I'm a new python user familiar with R.
I want to calculate user-defined quantiles for groups complete with the count of observations in each group.
In R I would do:
df_sum <- df %>% group_by(group) %>%
dplyr::summarise(q85 = quantile(obsval, probs = 0.85, type = 8),
n = n())
In python I can get the grouped percentile by:
df_sum = df.groupby(['group'])['obsval'].quantile(0.85)
How do I add the group count to this?
I have tried:
df_sum = df.groupby(['group'])['obsval'].describe(percentile=[0.85])[[count]]
df_sum = df.groupby(['group'])['obsval'].quantile(0.85).describe(['count'])
Example data:
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df
Expected result:
group percentile count
A 7.4 5
B 6.55 4

You can use pandas.DataFrame.agg() to apply multiple functions.
In this case you should use numpy.quantile().
import pandas as pd
import numpy as np
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df_sum = df.groupby(['group'])['obsval'].agg([lambda x : np.quantile(x, q=0.85), "count"])
df_sum.columns = ['percentile', 'count']
print(df_sum)

Related

dplython: remove duplicates

Is there a way to remove duplicate rows for two specified columns using dplython?
This is an example of what I want to accomplish:
import pandas as pd
from dplython import *
data = {'store': [1, 1, 2, 2, 4, 4],
'Type': ['A', 'A', 'A', 'B', 'B', 'B'],
'weekly_sales': [100, 200, 300, 400, 500, 200]}
df = pd.DataFrame(data)
df.drop_duplicates(subset=["store", "Type"])
This is my dplython attempt:
df_R = DplyFrame(df)
df_R >> sift(drop_duplicates(subset=[X.store,X.Type]))
Thanks!

how to remove count from a plotly express bar chart hover data?

Given the following code:
import pandas as pd
import plotly.express as px
d = {'col1': ['a', 'a', 'b', 'b', 'b'], 'col2': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data=d)
fig = px.bar(df, y='col1', color='col1')
fig.show()
that generates the following bar plot:
how do I remove count from hover_data?
plotly==5.1.0
You can remove it from hovertemplate
import pandas as pd
import plotly.express as px
d = {'col1': ['a', 'a', 'b', 'b', 'b'], 'col2': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data=d)
fig = px.bar(df, y='col1', color='col1').update_traces(hovertemplate='col1=%{y}<br><extra></extra>')
fig.show()

dataframe pivoting and adding and convert category into columns with prefix

I am trying to transform this Dataframe.
To look like the following:
Here is the code to create the sample df
df = pd.DataFrame(data = [[1, 'A', 0, '2021-07-01'],
[1, 'B', 1, '2021-07-02'],
[2, 'D', 3, '2021-07-02'],
[2, 'C', 2, '2021-07-02'],
[2, 'E', 4, '2021-07-02']
], columns = ['id', 'symbol', 'value', 'date'])
symbol_list = [['A', 'B', ''], ['C','D','E']]
The end result dataframe is grouped by id field with symbol column turns into multiple columns with symbol ordering mapped to the user input list.
I was using .apply() method to construct each datarow for the above dataframe but it is taking very long time for 10000+ datapoints.
I am trying to find a more efficient way to transform the dataframe. I am thinking that I will need to use pivot function to unstack the data frame with the combination of resetting index (to turn category value into column). Appreciate any help on this!
Use GroupBy.cumcount with DataFrame.unstack for reshape, then extract date by DataFrame.pop with max per rows, flatten columns and last add new column date by DataFrame.assign:
df = pd.DataFrame(data = [[1, 'A', 0, '2021-07-01'],
[1, 'B', 1, '2021-07-02'],
[2, 'D', 3, '2021-07-02'],
[2, 'C', 2, '2021-07-02'],
[2, 'E', 4, '2021-07-02']
], columns = ['id', 'symbol', 'value', 'date'])
#IMPORTANT all values from symbol_list are in column symbol (without empty strings)
symbol_list = [['A', 'B', ''], ['C','D','E']]
order = [y for x in symbol_list for y in x if y]
print (order)
['A', 'B', 'C', 'D', 'E']
#convert all values to Categoricals with specified order by flatten lists
df['symbol'] = pd.Categorical(df['symbol'], ordered=True, categories=order)
df['date'] = pd.to_datetime(df['date'])
#sorting by id and symbol
df = df.sort_values(['id','symbol'])
df1 = df.set_index(['id',df.groupby('id').cumcount()]).unstack()
date_max = df1.pop('date').max(axis=1)
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.assign(date = date_max)
print (df1)
symbol_0 symbol_1 symbol_2 value_0 value_1 value_2 date
id
1 A B NaN 0.0 1.0 NaN 2021-07-02
2 C D E 2.0 3.0 4.0 2021-07-02

Pandas groupby multiple columns with values of unique groupings as their own column

Example Dataframe =
df = pd.DataFrame({'ID': [1,1,2,2,2,3,3,3],
... 'Type': ['b','b','b','a','a','a','a']})
I would like to return the counts grouped by ID and then a column for each unique ID in Type and the count of each Type for that grouped row:
pd.DataFrame({'ID': [1,2,3],'Count_TypeA': [0,2,3], 'CountTypeB':[2,1,0]}, 'TotalCount':[2,3,3])
Is there an easy way to do this using the groupby function in pandas?
For what you need you can use the method get_dummies from pandas. This will convert categorical variable into dummy/indicator variables. You can check the reference here.
Check if this meets your requirements:
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 2, 2, 3, 3, 3],
'Type': ['b', 'b', 'b', 'a', 'a', 'a', 'a', 'a']})
dummy_var = pd.get_dummies(df["Type"])
dummy_var.rename(columns={'a': 'CountTypeA', 'b': 'CountTypeB'}, inplace=True)
df1 = pd.concat([df['ID'], dummy_var], axis=1)
df_group1 = df1.groupby(by=['ID'], as_index=False).sum()
df_group1['TotalCount'] = df_group1['CountTypeA'] + df_group1['CountTypeB']
print(df_group1)
This will print the following result:
ID CountTypeA CountTypeB TotalCount
0 1 0 2 2
1 2 2 1 3
2 3 3 0 3

Why does the np.where function also seem to work on values

I'm trying to change the values of only certain values in a dataframe:
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a':2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr[x]), test.col2)
However, this doesn't seem to work because even though I'm looking only at the values in col1 that are 'a', the error says
KeyError: 'b'
Implying that it also looks at the values of col1 with values 'b'. Why is this? And how do I fix it?
The error is originating from the test.col1.map(lambda x: dict_curr[x]) part. You look up the values from col1 in dict_curr, which only has an entry for 'a', not for 'b'.
You can also just index the dataframe:
test.loc[test.col1 == 'a', 'col2'] = 2
The problem is that when you call np.where all of its parameters are evaluated first, and then the result is decided depending on the condition. So the dictionary is queried also for 'b' and 'c', even if those values will be discarded later. Probably the easiest fix is:
import pandas as pd
import numpy as np
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr.get(x, 0)), test.col2)
This will give the value 0 for keys not in the dictionary, but since it will be discarded later it does not matter which value you use.
Another easy way of getting the same result is:
import pandas as pd
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = test.apply(lambda x: dict_curr.get(x.col1, x.col2), axis=1)

Categories