GroupBy aggregate function that computes two values at once - python

I have a datafame like the following:
import pandas as pd
df = pd.DataFrame({
'A': [1, 1, 1, 2, 2, 2],
'B': [1, 2, 3, 4, 5, 6],
'C': [4, 5, 6, 7, 8, 9],
})
Now I want to group and aggregate with two values being produced per group. The result should be similar to the following:
expected = df.groupby('A').agg([min, max])
# B C
# min max min max
# A
# 1 1 3 4 6
# 2 4 6 7 9
However, in my case, instead of two distinct functions min and max, I have one function that computes these two values at once:
def minmax(x):
"""This function promises to compute the min and max in one go."""
return min(x), max(x)
Now my question is, how can I use this one function to produce two aggregation values per group?
It's kind of related to this answer but I couldn't figure out how to do it. The best I could come up with is using a doubly-nested apply however this is not very elegant and also it produces the multi-index on the rows rather than on the columns:
result = df.groupby('A').apply(
lambda g: g.drop(columns='A').apply(
lambda h: pd.Series(dict(zip(['min', 'max'], minmax(h))))
)
)
# B C
# A
# 1 min 1 4
# max 3 6
# 2 min 4 7
# max 6 9

If you are stuck with a function that returns a tuple of values. I'd:
Define a new function that wraps the tuple values into a dict such that you predefine the dict.keys() to align with what you want the column names to be.
Use a careful for loop that doesn't waste time and space.
Wrap Function
# Given Function
def minmax(x):
"""This function promises to compute the min and max in one go."""
return min(x), max(x)
# wrapped function
def minmax_dict(x):
return dict(zip(['min', 'max'], minmax(x)))
Careful for loop
I'm aiming to pass this dictionary into the pd.DataFrame constructor. That means, I want tuples of the MultiIndex column elements in the keys. I want the values to be dictionaries with keys being the index elements.
dat = {}
for a, d in df.set_index('A').groupby('A'):
for cn, c in d.iteritems():
for k, v in minmax_dict(c).items():
dat.setdefault((cn, k), {})[a] = v
pd.DataFrame(dat).rename_axis('A')
B C
min max min max
A
1 1 3 4 6
2 4 6 7 9
Added Detail
Take a look at the crafted dictionary
data
{('B', 'min'): {1: 1, 2: 4},
('B', 'max'): {1: 3, 2: 6},
('C', 'min'): {1: 4, 2: 7},
('C', 'max'): {1: 6, 2: 9}}

One other solution:
pd.concat({k:d.agg(minmax).set_axis(['min','max'])
for k,d in df.drop('A',axis=1).groupby(df['A'])
})
Output:
B C
1 min 1 4
max 3 6
2 min 4 7
max 6 9

Related

Get the sum of a multikey dict by one key and add it to a datfarme column in Python?

I have a dataframe and a dict as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2], [4, 5]]),columns=['a', 'b'])
df
a b
0 1 2
1 4 5
dict
{(0, 'A', 1): 1, (0, 'A', 2): 2, (1, 'B', 1): 3, (1, 'B', 2): 4}
I am trying to get the total sum by the first key of the dict and add the result as a new column to my dataframe.
This is what I have so far, but I am thinking there must be a more efficient way to do this.
total_by_1st={}
for (x, _, _), v in dict.items():
if x in total_by_1st:
total_by_1st[x] += v
else:
total_by_1st[x]=v
total_by_1st
{0: 3, 1: 7}
df['c'] = df.index.map(total_by_1st)
df
a b c
0 1 2 3
1 4 5 7
I am trying to get the total sum by the first key of the dict and add the result as a new column to my dataframe
You can convert to series and sum on level 0:
df['new'] = pd.Series(d).sum(level=0)
print(df)
a b new
0 1 2 3
1 4 5 7
Where d is the name of the variable which stores your dictionary. Please note that you should not name a variable same as a builtin (d or something similar instead of dict)

Conditional sum on multiple pandas groups, each defined by a set of overlapping column values

I'm trying to perform a conditional sum on groups of rows defined by a list of arbitrary column values. By conditional sum, I mean sum values in one column only if the value in a second column is above a threshold. There can be overlap among the groups and the number of elements in each group can be different.
For example, given the following dataframe:
data = {
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'counter': [10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
'output': [5, 10, 15, 20, 25, 35, 20, 15, 10, 5]
}
df = pd.DataFrame(data)
>>> df
id counter output
0 1 10 5
1 2 9 10
2 3 8 15
3 4 7 20
4 5 6 25
5 6 5 35
6 7 4 20
7 8 3 15
8 9 2 10
9 10 1 5
And the following inputs (I'm flexible if we need to change their format):
group_ids = {'Group A': [1, 2, 3, 4], 'Group B': [6, 7, 8, 9], 'Group C': [4, 5, 6]}
output_threshold = 12
I would like to generate the following new dataframe, which is the sum of counter for each group defined by the list of group_ids only if output exceeds the specified output_threshold. Bonus points if I can add the title to each of these groups:
title sum
Group A 15
Group B 12
Group C 18
You can use isin to check for values and sum:
mask = (df['output'] > output_threshold).astype(int)
for k,v in group_ids.items():
df[k] = df['id'].isin(v) * mask * df['counter']
df[group_ids.keys()].sum()
Output (can't quite match your expected):
Group A 15
Group B 12
Group C 18
dtype: int64
Working within dictionaries and rebuilding the dataframe could work as well :
from collections import defaultdict
from itertools import product
d = defaultdict(list)
#get the product of M and group_ids
for entry, groups in product(M,group_ids.items()):
#pass in the condition
if entry['id'] in groups[-1] and entry['output'] > output_threshold:
#extract relevant counter value
d[groups[0]].append(entry['counter'])
#sum the list values
d = {k:sum(v) for k,v in d.items()}
#create dataframe
res = pd.DataFrame.from_dict(d,orient='index',columns=['Total'])
res
Total
Group A 15
Group C 18
Group B 12
The following is a solution to the question in this post, except for the fact that the id groups CANNOT be overlapping.
You may want to create an inverse_group dictionary, map the id and groupby:
inverse_groups={x:k for k,v in group_ids.items() for x in v}
(df[df['output']>output_threshold]
.groupby(df['id'].map(inverse_groups))
.counter.sum()
)
Output:
id
Group A 8
Group B 7
Group C 18
Name: counter, dtype: int64
Full credit goes to #QuangHoang. This was his first reply to my question, which he completely re-wrote to satisfy the overlapping id condition.

How to join a dataframe and dictionary on two rows

I have a dictionary and a dataframe. The dictionary contains a mapping of one letter to one number and the dataframe has a row containing these specific letters and another row containing these specific numbers, adjacent to each other (not that it necessarily matters).
I want to update the row containing the numbers by matching each letter in the row of the dataframe with the letter in the dictionary and then replacing the corresponding number (number in the same column as the letter) with the value of that letter from the dictionary.
df = pd.DataFrame(np.array([[4, 5, 6], ['a', 'b', 'c'], [7, 8, 9]]))
dict = {'a':2, 'b':3, 'c':5}
Let's say dict is the dictionary and df is the dataframe I want the result to be df2.
df2 = pd.DataFrame(np.array([[3, 2, 5], ['b', 'a', 'c'], [7, 8, 9]]))
df
0 1 2
0 4 5 6
1 a b c
2 7 8 9
dict
{'a': 2, 'b': 3, 'c': 5}
df2
0 1 2
0 2 3 5
1 a b c
2 7 8 9
I do not know how to use merge or join to fix this, my initial thoughts are to make the dictionary a dataframe object but I am not sure where to go from there.
It's a little weird, but:
df = pd.DataFrame(np.array([[4, 5, 6], ['a', 'b', 'c'], [7, 8, 9]]))
d = {'a': 2, 'b': 3, 'c': 5}
df.iloc[0] = df.iloc[1].map(lambda x: d[x] if x in d.keys() else x)
df
# 0 1 2
# 0 2 3 5
# 1 a b c
# 2 7 8 9
I couldn't bring myself to redefine dict to be a particular dictionary. :D
After receiving a much-deserved smackdown regarding the speed of apply, I present to you the theoretically faster approach below:
df.iloc[0] = df.iloc[1].map(d).where(df.iloc[1].isin(d.keys()), df.iloc[0])
This gives you the dictionary value of d (df.iloc[1].map(d)) if the value in row 1 is in the keys of d (.where(df.iloc[1].isin(d.keys()), ...), otherwise gives you the value in row 0 (...df.iloc[0])).
Hope this helps!

create a top 10 list for multiple groups with a ranking in python [duplicate]

I have a pandas data frame that has is composed of different subgroups.
df = pd.DataFrame({
'id':[1, 2, 3, 4, 5, 6, 7, 8],
'group':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'value':[.01, .4, .2, .3, .11, .21, .4, .01]
})
I want to find the rank of each id in its group with say, lower values being better. In the example above, in group A, Id 1 would have a rank of 1, Id 2 would have a rank of 4. In group B, Id 5 would have a rank of 2, Id 8 would have a rank of 1 and so on.
Right now I assess the ranks by:
Sorting by value.
df.sort('value', ascending = True, inplace=True)
Create a ranker function (it assumes variables already sorted)
def ranker(df):
df['rank'] = np.arange(len(df)) + 1
return df
Apply the ranker function on each group separately:
df = df.groupby(['group']).apply(ranker)
This process works but it is really slow when I run it on millions of rows of data. Does anyone have any ideas on how to make a faster ranker function.
rank is cythonized so should be very fast. And you can pass the same options as df.rank()
here are the docs for rank. As you can see, tie-breaks can be done in one of five different ways via the method argument.
Its also possible you simply want the .cumcount() of the group.
In [12]: df.groupby('group')['value'].rank(ascending=False)
Out[12]:
0 4
1 1
2 3
3 2
4 3
5 2
6 1
7 4
dtype: float64
Working with a big DataFrame (13 million lines), the method rank with groupby maxed out my 8GB of RAM an it took a really long time. I found a workaround less greedy in memory , that I put here just in case:
df.sort_values('value')
tmp = df.groupby('group').size()
rank = tmp.map(range)
rank =[item for sublist in rank for item in sublist]
df['rank'] = rank

Pandas GroupBy Index

I have a dataframe with a column that I want to groupby. Within each group, I want to perform a check to see if the first values is less than the second value times some scalar, e.g. (x < y * .5). If it is, the first value is set to True and all other values False. Else, all values are False.
I have a sample data frame here:
d = pd.DataFrame(np.array([[0, 0, 1, 1, 2, 2, 2],
[3, 4, 5, 6, 7, 8, 9],
[1.25, 10.1, 2.3, 2.4, 1.2, 5.5, 5.7]]).T,
columns=['a', 'b', 'c'])
I can get a stacked groupby to get the data that I want out a a:
g = d.groupby('a')['c'].nsmallest(2).groupby(level='a')
This results in three groups, each with 2 entries. By adding an apply, I can call a function to return a boolean mask:
def func(group):
if group.iloc[0] < group.iloc[1] * .5:
return [True, False]
else:
return [False, False]
g = d.groupby('a')['c'].nsmallest(2).groupby(level='a').apply(func)
Unfortunately, this destroys the index into the original dataframe and removes the ability to handle cases where more than 2 elements are present.
Two questions:
Is it possible to maintain the index in the original dataframe and update a column with the results of a groupby? This is made slightly different because the .nsmallest call results in a Series on the 'c' column.
Does a more elegant method exist to compute a boolean array for groups in a dataframe based on some custom criteria, e.g. this ratio test.
Looks like transform is what you need:
>>> def func(group):
... res = [False] * len(group)
... if group.iloc[0] < group.iloc[1] * .5:
... res[0] = True
... return res
>>> d['res'] = d.groupby('a')['c'].transform(func).astype('bool')
>>> d
a b c res
0 0 3 1.25 True
1 0 4 10.10 False
2 1 5 2.30 False
3 1 6 2.40 False
4 2 7 1.20 True
5 2 8 5.50 False
6 2 9 5.70 False
From the documentation:
The transform method returns an object that is indexed the same (same
size) as the one being grouped. Thus, the passed transform function
should return a result that is the same size as the group chunk. For
example, suppose we wished to standardize the data within each group

Categories