Recursive groupby with quantiles - python

I have a dataframe of floats
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
1 0.433127 0.479051 0.159739 0.734577 0.113672
2 0.391228 0.516740 0.430628 0.586799 0.737838
3 0.956267 0.284201 0.648547 0.696216 0.292721
4 0.001490 0.973460 0.298401 0.313986 0.891711
5 0.585163 0.471310 0.773277 0.030346 0.706965
6 0.374244 0.090853 0.660500 0.931464 0.207191
7 0.630090 0.298163 0.741757 0.722165 0.218715
I can divide it into quantiles for a single column like so:
def groupby_quantiles(df, column, groups: int):
quantiles = df[column].quantile(np.linspace(0, 1, groups + 1))
bins = pd.cut(df[column], quantiles, include_lowest=True)
return df.groupby(bins)
>>> df.pipe(groupby_quantiles, "a", 2).apply(lambda x: print(x))
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
2 0.391228 0.516740 0.430628 0.586799 0.737838
4 0.001490 0.973460 0.298401 0.313986 0.891711
6 0.374244 0.090853 0.660500 0.931464 0.207191
a b c d e
1 0.433127 0.479051 0.159739 0.734577 0.113672
3 0.956267 0.284201 0.648547 0.696216 0.292721
5 0.585163 0.471310 0.773277 0.030346 0.706965
7 0.630090 0.298163 0.741757 0.722165 0.218715
Now, I want to repeat the same operation on each of the groups for the next column. The code becomes ridiculous
>>> (
df
.pipe(groupby_quantiles, "a", 2)
.apply(
lambda df_group: (
df_group
.pipe(groupby_quantiles, "b", 2)
.apply(lambda x: print(x))
)
)
)
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
6 0.374244 0.090853 0.660500 0.931464 0.207191
a b c d e
2 0.391228 0.51674 0.430628 0.586799 0.737838
4 0.001490 0.97346 0.298401 0.313986 0.891711
a b c d e
3 0.956267 0.284201 0.648547 0.696216 0.292721
7 0.630090 0.298163 0.741757 0.722165 0.218715
a b c d e
1 0.433127 0.479051 0.159739 0.734577 0.113672
5 0.585163 0.471310 0.773277 0.030346 0.706965
My goal is to repeat this operation for as many columns as I want, then aggregate the groups at the end. Here's how the final function could look like and the desired result assuming to aggregate with the mean.
>>> groupby_quantiles(df, columns=["a", "b"], groups=[2, 2], agg="mean")
a b c d e
0 0.229947 0.163832 0.730887 0.756813 0.150660
1 0.196359 0.745100 0.364515 0.450392 0.814774
2 0.793179 0.291182 0.695152 0.709190 0.255718
3 0.509145 0.475180 0.466508 0.382462 0.410319
Any ideas on how to achieve this?

Here is a way. First using quantile then cut can be rewrite with qcut. Then using recursive operation similar to this.
def groupby_quantiles(df, cols, grs, agg_func):
# to store all the results
_dfs = []
# recursive function
def recurse(_df, depth):
col = cols[depth]
gr = grs[depth]
# iterate over the groups per quantile
for _, _dfgr in _df.groupby(pd.qcut(_df[col], gr)):
if depth != -1: recurse(_dfgr, depth+1) #recursive if not at the last column
else: _dfs.append(_dfgr.agg(agg_func)) #else perform the aggregate
# using negative depth is easier to acces the right column and quantile
depth = -len(cols)
recurse(df, depth) # starts the recursion
return pd.concat(_dfs, axis=1).T # concat the results and transpose
print(groupby_quantiles(df, cols = ['a','b'], grs = [2,2], agg_func='mean'))
# a b c d e
# 0 0.229946 0.163832 0.730887 0.756813 0.150660
# 1 0.196359 0.745100 0.364515 0.450392 0.814774
# 2 0.793179 0.291182 0.695152 0.709190 0.255718
# 3 0.509145 0.475181 0.466508 0.382462 0.410318

Related

Pandas: Groupby, concatenate one column and identify the row with maximums

I have a datframe like this:
prefix input_text target_text score
X V A 1
X V B 2
X W C 1
X W B 3
I want to group them by some columns and concatenate the column target_text, meanwhile get the maximum of score in each group and identify the target_text with highest score, like this:
prefix input_text target_text score top
X V A, B 2 B
X W C, B 3 B
This is my code which does the concatenation, however I just don't know about the rest.
df['target_text'] = df[['prefix', 'target_text','input_text']].groupby(['input_text','prefix'])['target_text'].transform(lambda x: '<br />'.join(x))
df = df.drop_duplicates(subset=['prefix','input_text','target_text'])
In concatenation I use html code to concat them, if I could bold the target with highest score, then it would be nice.
Let us try
df.sort_values('score',ascending=False).\
drop_duplicates(['prefix','input_text']).\
rename(columns={'target_text':'top'}).\
merge(df.groupby(['prefix','input_text'],as_index=False)['target_text'].agg(','.join))
Out[259]:
prefix input_text top score target_text
0 X W B 3 C,B
1 X V B 2 A,B
groupby agg would be useful here:
new_df = (
df.groupby(['prefix', 'input_text'], as_index=False).agg(
target_text=('target_text', ', '.join),
score=('score', 'max'),
top=('score', 'idxmax')
)
)
new_df['top'] = df.loc[new_df['top'], 'target_text'].values
new_df:
prefix input_text target_text score top
0 X V A, B 2 B
1 X W C, B 3 B
Aggregations are as follows:
target_text is joined together using ', '.join.
score is aggregated to only keep the max value with `'max'
top is the idxmax of the score column.
new_df = (
df.groupby(['prefix', 'input_text'], as_index=False).agg(
target_text=('target_text', ', '.join),
score=('score', 'max'),
top=('score', 'idxmax')
)
)
prefix input_text target_text score top
0 X V A, B 2 1
1 X W C, B 3 3
The values in top are the corresponding indexes from df:
prefix input_text target_text score
0 X V A 1
1 X V B 2 # index 1
2 X W C 1
3 X W B 3 # index 3
These values need to be "looked up" from df:
df.loc[new_df['top'], 'target_text']
1 B
3 B
Name: target_text, dtype: object
And assigned back to new_df. values is needed to break the index alignment.
try via sort_values(), groupby() and agg():
out=(df.sort_values('score')
.groupby(['prefix', 'input_text'], as_index=False)
.agg(target_text=('target_text', ', '.join), score=('score', 'max'), top=('target_text', 'last')))
output of out:
input_text prefix score target_text top
0 V X 2 A, B B
1 W X 3 C, B B
Explaination:
we are sorting values of 'score' and then grouping by column 'input_text' and 'prefix' and aggregrating values that are as follows:
we are joining together the values of 'target_text' by ', '
we are getting only max value of 'score column' bcz we are aggregrating max
we are getting last value of 'target_text' column since we sorted previously so now we are aggregrating last on it
Update:
If you have many more columns to include then you can aggregrate them if they are not in high in number otherwise:
newdf=df.sort_values('score',ascending=False).drop_duplicates(['prefix','input_text'],ignore_index=True)
#Finally join them
out=out.join(newdf[list of column names that you want])
#For example:
#out=out.join(newdf[['target_first','target_last]])

Groupby and select the first, second, and fourth member of each group?

Related: pandas dataframe groupby and get nth row
I can use the groupby method and select the first N number of group members with:
df.groupby('columnA').head(N)
But what if I want the first, second, and fourth members of each group?
GroupBy.nth takes a list, so you could just do
df = pd.DataFrame({'A': list('aaaabbbb'), 'B': list('abcdefgh')})
df.groupby('A').nth([0, 1, 3])
B
A
a a
a b
a d
b e
b f
b h
# To get the grouper as a column, use as_index=False
df.groupby('A', as_index=False).nth([0, 1, 3])
A B
0 a a
1 a b
3 a d
4 b e
5 b f
7 b h
You can do
df.groupby('columnA').apply(lambda x : x.iloc[[has to 0,1,3],:]).reset_index(level=0,drop=True)
df1 = df.groupby('columnA').head(4)
df1.drop(df.groupby('columnA').head(4).index.values[2], axis=0)

Pandas implode Dataframe with values separated by char

I was just wondering how is the best approach to implode a DataFrame with values separated by a given char.
For example, imagine this dataframe:
A B C D E
1 z a q p
2 x s w l
3 c d e k
4 v f r m
5 b g t n
And we want to implode by #
A B C D E
1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n
Maybe to create a copy from the original dataframe and process column by column with Pandas str.concat?
Thanks in advance!
Use DataFrame.agg with join, then convert Series to one row DataFrame with Series.to_frame and transpose by DataFrame.T:
df = df.astype(str).agg('#'.join).to_frame().T
print (df)
A B C D E
0 1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})
This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))
To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

How to count the number of occurrences in either of two columns

I have a simple looking problem. I have a dataframe df with two columns. For each of the strings that occurs in either of these columns I would like to count the number of rows which has the symbol in either column.
E.g.
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h
The following code works but is very inefficient.
for elem in set(df.values.flat):
print elem, len(df.loc[(df[0] == elem) | (df[1] == elem)])
a 2
c 1
b 1
e 1
d 3
g 1
i 4
h 3
k 1
j 1
This is however very inefficient and my dataframe is large. The inefficiency comes from calling df.loc[(df[0] == elem) | (df[1] == elem)] separately for every distinct symbol in df.
Is there a fast way of doing this?
You can use loc to filter out row level matches from 'col2', append the filtered 'col2' values to 'col1', and then call value_counts:
counts = df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
The resulting output:
i 4
d 3
h 3
a 2
j 1
k 1
c 1
g 1
b 1
e 1
Note: You can add .sort_index() to the end of the counting code if you want the output to appear in alphabetical order.
Timings
Using the following setup to produce a larger sample dataset:
from string import ascii_lowercase
n = 10**5
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['col1', 'col2'])
def edchum(df):
vals = np.unique(df.values)
count = np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
return count
I get the following timings:
%timeit df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
10 loops, best of 3: 19.7 ms per loop
%timeit edchum(df)
1 loop, best of 3: 3.81 s per loop
OK this is much trickier than I thought, not sure how this will scale but if you have a lot of repeating values then it will be more efficient than your current method, basically we can use str.get_dummies and reindex the columns from that result to generate a dummies df for all unique values, we can then use np.maximal on the 2 dfs and sum these:
In [77]:
t="""col1 col2
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h"""
df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
Out[77]:
a 2
b 1
c 1
d 3
e 1
g 1
h 3
i 4
j 1
k 1
dtype: float64
vals here is just the unique values:
In [80]:
vals = np.unique(df.values)
vals
Out[80]:
array(['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j', 'k'], dtype=object)

Categories