Recursive groupby with quantiles

Recursive groupby with quantiles - python

I have a dataframe of floats
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
1 0.433127 0.479051 0.159739 0.734577 0.113672
2 0.391228 0.516740 0.430628 0.586799 0.737838
3 0.956267 0.284201 0.648547 0.696216 0.292721
4 0.001490 0.973460 0.298401 0.313986 0.891711
5 0.585163 0.471310 0.773277 0.030346 0.706965
6 0.374244 0.090853 0.660500 0.931464 0.207191
7 0.630090 0.298163 0.741757 0.722165 0.218715
I can divide it into quantiles for a single column like so:
def groupby_quantiles(df, column, groups: int):
quantiles = df[column].quantile(np.linspace(0, 1, groups + 1))
bins = pd.cut(df[column], quantiles, include_lowest=True)
return df.groupby(bins)
>>> df.pipe(groupby_quantiles, "a", 2).apply(lambda x: print(x))
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
2 0.391228 0.516740 0.430628 0.586799 0.737838
4 0.001490 0.973460 0.298401 0.313986 0.891711
6 0.374244 0.090853 0.660500 0.931464 0.207191
a b c d e
1 0.433127 0.479051 0.159739 0.734577 0.113672
3 0.956267 0.284201 0.648547 0.696216 0.292721
5 0.585163 0.471310 0.773277 0.030346 0.706965
7 0.630090 0.298163 0.741757 0.722165 0.218715
Now, I want to repeat the same operation on each of the groups for the next column. The code becomes ridiculous
>>> (
df
.pipe(groupby_quantiles, "a", 2)
.apply(
lambda df_group: (
df_group
.pipe(groupby_quantiles, "b", 2)
.apply(lambda x: print(x))
)
)
)
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
6 0.374244 0.090853 0.660500 0.931464 0.207191
a b c d e
2 0.391228 0.51674 0.430628 0.586799 0.737838
4 0.001490 0.97346 0.298401 0.313986 0.891711
a b c d e
3 0.956267 0.284201 0.648547 0.696216 0.292721
7 0.630090 0.298163 0.741757 0.722165 0.218715
a b c d e
1 0.433127 0.479051 0.159739 0.734577 0.113672
5 0.585163 0.471310 0.773277 0.030346 0.706965
My goal is to repeat this operation for as many columns as I want, then aggregate the groups at the end. Here's how the final function could look like and the desired result assuming to aggregate with the mean.
>>> groupby_quantiles(df, columns=["a", "b"], groups=[2, 2], agg="mean")
a b c d e
0 0.229947 0.163832 0.730887 0.756813 0.150660
1 0.196359 0.745100 0.364515 0.450392 0.814774
2 0.793179 0.291182 0.695152 0.709190 0.255718
3 0.509145 0.475180 0.466508 0.382462 0.410319
Any ideas on how to achieve this?

Here is a way. First using quantile then cut can be rewrite with qcut. Then using recursive operation similar to this.
def groupby_quantiles(df, cols, grs, agg_func):
# to store all the results
_dfs = []
# recursive function
def recurse(_df, depth):
col = cols[depth]
gr = grs[depth]
# iterate over the groups per quantile
for _, _dfgr in _df.groupby(pd.qcut(_df[col], gr)):
if depth != -1: recurse(_dfgr, depth+1) #recursive if not at the last column
else: _dfs.append(_dfgr.agg(agg_func)) #else perform the aggregate
# using negative depth is easier to acces the right column and quantile
depth = -len(cols)
recurse(df, depth) # starts the recursion
return pd.concat(_dfs, axis=1).T # concat the results and transpose
print(groupby_quantiles(df, cols = ['a','b'], grs = [2,2], agg_func='mean'))
# a b c d e
# 0 0.229946 0.163832 0.730887 0.756813 0.150660
# 1 0.196359 0.745100 0.364515 0.450392 0.814774
# 2 0.793179 0.291182 0.695152 0.709190 0.255718
# 3 0.509145 0.475181 0.466508 0.382462 0.410318

Related

Pandas: Groupby, concatenate one column and identify the row with maximums

I have a datframe like this:
prefix input_text target_text score
X V A 1
X V B 2
X W C 1
X W B 3
I want to group them by some columns and concatenate the column target_text, meanwhile get the maximum of score in each group and identify the target_text with highest score, like this:
prefix input_text target_text score top
X V A, B 2 B
X W C, B 3 B
This is my code which does the concatenation, however I just don't know about the rest.
df['target_text'] = df[['prefix', 'target_text','input_text']].groupby(['input_text','prefix'])['target_text'].transform(lambda x: '<br />'.join(x))
df = df.drop_duplicates(subset=['prefix','input_text','target_text'])
In concatenation I use html code to concat them, if I could bold the target with highest score, then it would be nice.

Let us try
df.sort_values('score',ascending=False).\
drop_duplicates(['prefix','input_text']).\
rename(columns={'target_text':'top'}).\
merge(df.groupby(['prefix','input_text'],as_index=False)['target_text'].agg(','.join))
Out[259]:
prefix input_text top score target_text
0 X W B 3 C,B
1 X V B 2 A,B

groupby agg would be useful here:
new_df = (
df.groupby(['prefix', 'input_text'], as_index=False).agg(
target_text=('target_text', ', '.join),
score=('score', 'max'),
top=('score', 'idxmax')
)
)
new_df['top'] = df.loc[new_df['top'], 'target_text'].values
new_df:
prefix input_text target_text score top
0 X V A, B 2 B
1 X W C, B 3 B
Aggregations are as follows:
target_text is joined together using ', '.join.
score is aggregated to only keep the max value with `'max'
top is the idxmax of the score column.
new_df = (
df.groupby(['prefix', 'input_text'], as_index=False).agg(
target_text=('target_text', ', '.join),
score=('score', 'max'),
top=('score', 'idxmax')
)
)
prefix input_text target_text score top
0 X V A, B 2 1
1 X W C, B 3 3
The values in top are the corresponding indexes from df:
prefix input_text target_text score
0 X V A 1
1 X V B 2 # index 1
2 X W C 1
3 X W B 3 # index 3
These values need to be "looked up" from df:
df.loc[new_df['top'], 'target_text']
1 B
3 B
Name: target_text, dtype: object
And assigned back to new_df. values is needed to break the index alignment.

try via sort_values(), groupby() and agg():
out=(df.sort_values('score')
.groupby(['prefix', 'input_text'], as_index=False)
.agg(target_text=('target_text', ', '.join), score=('score', 'max'), top=('target_text', 'last')))
output of out:
input_text prefix score target_text top
0 V X 2 A, B B
1 W X 3 C, B B
Explaination:
we are sorting values of 'score' and then grouping by column 'input_text' and 'prefix' and aggregrating values that are as follows:
we are joining together the values of 'target_text' by ', '
we are getting only max value of 'score column' bcz we are aggregrating max
we are getting last value of 'target_text' column since we sorted previously so now we are aggregrating last on it
Update:
If you have many more columns to include then you can aggregrate them if they are not in high in number otherwise:
newdf=df.sort_values('score',ascending=False).drop_duplicates(['prefix','input_text'],ignore_index=True)
#Finally join them
out=out.join(newdf[list of column names that you want])
#For example:
#out=out.join(newdf[['target_first','target_last]])

Groupby and select the first, second, and fourth member of each group?

Related: pandas dataframe groupby and get nth row
I can use the groupby method and select the first N number of group members with:
df.groupby('columnA').head(N)
But what if I want the first, second, and fourth members of each group?

GroupBy.nth takes a list, so you could just do
df = pd.DataFrame({'A': list('aaaabbbb'), 'B': list('abcdefgh')})
df.groupby('A').nth([0, 1, 3])
B
A
a a
a b
a d
b e
b f
b h
# To get the grouper as a column, use as_index=False
df.groupby('A', as_index=False).nth([0, 1, 3])
A B
0 a a
1 a b
3 a d
4 b e
5 b f
7 b h

You can do
df.groupby('columnA').apply(lambda x : x.iloc[[has to 0,1,3],:]).reset_index(level=0,drop=True)

df1 = df.groupby('columnA').head(4)
df1.drop(df.groupby('columnA').head(4).index.values[2], axis=0)

Pandas implode Dataframe with values separated by char

I was just wondering how is the best approach to implode a DataFrame with values separated by a given char.
For example, imagine this dataframe:
A B C D E
1 z a q p
2 x s w l
3 c d e k
4 v f r m
5 b g t n
And we want to implode by #
A B C D E
1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n
Maybe to create a copy from the original dataframe and process column by column with Pandas str.concat?
Thanks in advance!

Use DataFrame.agg with join, then convert Series to one row DataFrame with Series.to_frame and transpose by DataFrame.T:
df = df.astype(str).agg('#'.join).to_frame().T
print (df)
A B C D E
0 1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})

This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))

To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

How to count the number of occurrences in either of two columns

I have a simple looking problem. I have a dataframe df with two columns. For each of the strings that occurs in either of these columns I would like to count the number of rows which has the symbol in either column.
E.g.
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h
The following code works but is very inefficient.
for elem in set(df.values.flat):
print elem, len(df.loc[(df[0] == elem) | (df[1] == elem)])
a 2
c 1
b 1
e 1
d 3
g 1
i 4
h 3
k 1
j 1
This is however very inefficient and my dataframe is large. The inefficiency comes from calling df.loc[(df[0] == elem) | (df[1] == elem)] separately for every distinct symbol in df.
Is there a fast way of doing this?

You can use loc to filter out row level matches from 'col2', append the filtered 'col2' values to 'col1', and then call value_counts:
counts = df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
The resulting output:
i 4
d 3
h 3
a 2
j 1
k 1
c 1
g 1
b 1
e 1
Note: You can add .sort_index() to the end of the counting code if you want the output to appear in alphabetical order.
Timings
Using the following setup to produce a larger sample dataset:
from string import ascii_lowercase
n = 10**5
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['col1', 'col2'])
def edchum(df):
vals = np.unique(df.values)
count = np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
return count
I get the following timings:
%timeit df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
10 loops, best of 3: 19.7 ms per loop
%timeit edchum(df)
1 loop, best of 3: 3.81 s per loop

OK this is much trickier than I thought, not sure how this will scale but if you have a lot of repeating values then it will be more efficient than your current method, basically we can use str.get_dummies and reindex the columns from that result to generate a dummies df for all unique values, we can then use np.maximal on the 2 dfs and sum these:
In [77]:
t="""col1 col2
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h"""
df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
Out[77]:
a 2
b 1
c 1
d 3
e 1
g 1
h 3
i 4
j 1
k 1
dtype: float64
vals here is just the unique values:
In [80]:
vals = np.unique(df.values)
vals
Out[80]:
array(['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j', 'k'], dtype=object)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Recursive groupby with quantiles - python

Related

Pandas: Groupby, concatenate one column and identify the row with maximums

Groupby and select the first, second, and fourth member of each group?

Pandas implode Dataframe with values separated by char

Renaming columns on slice of dataframe not performing as expected

How to count the number of occurrences in either of two columns

Categories

Resources